\RequirePackage{natbib} % Add all necessary packages here \usepackage{amsmath,amssymb,epsfig,tabularx,wrapfig} \usepackage{graphicx} \usepackage{color,soul} \usepackage{verbatim} \usepackage{ifthen} \usepackage{psfrag} \usepackage{siunitx} \usepackage{enumerate} \usepackage{hyperref} \usepackage{setspace} \usepackage{natbib} % Formatting \setlength{\topmargin}{-.5 in} \setlength{\textheight}{9. in} \setlength{\oddsidemargin}{.1in} \setlength{\evensidemargin}{-.35in} \setlength{\textwidth}{6in} \font\heada=cmbx10 scaled\magstep3 \font\headb=cmsl10 scaled\magstep1 \font\headc=cmr8 \pretolerance=10000 \raggedright \setlength{\parindent}{2 em} % Define new commands \newcommand{\bm}{\mathbf} \newcommand{\PR}{\mathrm{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\todo}[1]{\payAttention{red}{TODO}{#1}} % Identifying information \newcommand{\myname}{Ben DeVries} \newcommand{\myadvisor}{Andy Hoegh} \newcommand{\wprojcoord}{Ian Laga} % Current Writing Project Coordinator \newcommand{\maintitle}{Tree Based Learning Models Applied to Flying Fox Areal Data} \newcommand{\mydate}{Date of completion here as Month Day, Year}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Title page \begin{singlespace} \begin{titlepage} \null \vspace{2.in} % Enter title here \begin{center} {\LARGE\bfseries \maintitle} \vspace{.1in} \vspace{.05in} {\LARGE\bfseries $\;$} \\ [.5in] {\Large \myname \\ \vspace{0.5cm} Department of Mathematical Sciences \\ Montana State University \\ [.5in]} \mydate \\ [1.in] A writing project submitted in partial fulfillment\\ of the requirements for the degree\\[.25in] Master of Science in Statistics \end{center} \end{titlepage} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Signatures \begin{titlepage} \null \vspace{2.in} \begin{center} {\bfseries\huge APPROVAL}\\[1.in] of a writing project submitted by\\[.25in] \myname \\[1.in] \end{center} \noindent This writing project has been read by the writing project advisor and has been found to be satisfactory regarding content, English usage, format, citations, bibliographic style, and consistency, and is ready for submission to the Statistics Faculty. \vspace{.3in} \begin{center} \begin{tabular}{ll} \rule{2.75in}{.03in} & \rule{2.75in}{.03in} \\ Date& \myadvisor \\ & Writing Project Advisor \\ \end{tabular} \end{center} \vspace{1cm} \begin{center} \begin{tabular}{ll} \rule{2.75in}{.03in} & \rule{2.75in}{.03in} \\ Date& \wprojcoord \\ & Writing Projects Coordinator \\ \end{tabular} \end{center} \end{titlepage} \end{singlespace} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Abstract \vspace{2.in} \begin{abstract} \input{abstract} \end{abstract} \newpage

DeVries Writing Project

1 Introduction

tells us a flying fox is a large type of fruit bat. There are about 65 species found on tropical islands from Madagascar to Australia and Indonesia, as well as in parts of mainland Asia. Our interest in flying foxes stems from the Hendra virus, which is endemic to the four species of flying foxes in Australia. This rare virus is zoonotic, meaning it can be transferred from animals to humans. There is a fairly low chance of a bat transmitting Hendra virus to a person. Humans infected with Hendra virus have come into contact with horses exposed to waste/bodily fluids from flying foxes. In Pathogen Spillover Driven By Rapid Changes in Bat Ecology, identified loss of flying fox habitat and climate as influential factors in Hendra virus spillovers. Loss of habitat and climate both effect the availability of food and water, leading bats to venture into farmland. In this project we explore potential associations between land cover and flying fox feeding locations. To assess the impact of land cover on feeding locations, we apply Bayesian Additive Regression Trees (BART). A large portion of this project is devoted to learning about tree algorithms, and tools used to work with spatial data.

In traditional regression models we assume some sort of relationship between the response, variables of interest, and controlling factors. Exploratory data analysis, background information, and model experimentation can bring light to potential relationships, but many problems involve unobservable dynamics that are difficult or impossible to capture via a linear model. In some cases it is often necessary to take a more algorithmic approach to modelling. Unfortunately, many models in the algorithmic framework are referred to as black boxes, indicating the inner workings of a fitted model are difficult or impossible to explain in a meaningful way. Decision tree ensembles sit within the black box framework, but the explainability of individual trees has lead to establishing variable importance and interaction measures along with other useful tools for inference. It is still very difficult to interpret the effects of individual predictors in these models, as the hierarchical structure can make marginal effects very missleading. Still, variable importance measures combined with predictive accuracy make ensembles of trees a great tool for exploratory analysis. Our project starts with a brief review of supervised learning before discussing Classification and Regression Trees (CART) along with several models inspired by CART. Next, our attention shifts to tools for obtaining and preparing data for our analysis. In this project we use Google Earth Engine to obtain our covariates, and access the response from an archived PostgreSQL database. Finally, we apply BART to analyze the feeding locations of flying foxes and assess the predictive accuracy of our model.

2 Background

2.1 Supervised Learning

With this project, we chose to focus on tree based methods. The Elements of Statistical Learning was a key source for understanding algorithms and concepts in statistical learning. Our background discussion begins with some notation and concepts in supervised learning. We then discuss the algorithms for several models, using simulation to demonstrate there use. In supervised learning problems we work to map a set of features to a target. When discussing these components, we’ll follow the notation used in The Elements of Statistical Learning . \(X\) denotes the features while \(Y\) and \(G\) denote the target for quantitative and categorical responses respectively. If there are \(N\) observations of \(p\) inputs, then \(\bm X\) is an \(N\times p\) matrix. The inputs for the \(i\)’th observation are \(\bm x_i\) while \(\bm x_j\) denotes the vector of all observations for the \(j\)’th input.

A common approach in supervised learning is to assume our features and target are observation of random variables with joint distribution \(\text{Pr}(Y,X)\). \(G\) would replace \(Y\) in a classification problem and \(\bm X\) would replace \(X\) for multiple covariates. We want to find a function \(f:\mathcal{X}\rightarrow\mathcal{Y}\) that predicts \(Y\) reasonably well. To do this, we define a loss function \(L(Y,f(X))\) which penalizes the prediction error on a training set \(\boldsymbol \tau=(x_i,y_i),\ i=1,2,...,N\). The expected prediction error (EPE) is defined as \(\mathbb{E}[L(Y,f(X))]\). Our overall goal is to approximate the function \(f\) that minimizes the EPE. Using squared errors for loss \((L_2)\), the EPE is minimized by choosing \(f\) such that \(f(x)=\mathbb{E}[Y|X=x]\) (The Elements of Statistical Learning, pg. 18 ). Now we need an algorithm to approximate \(f\) using \(\boldsymbol \tau\). Choosing an appropriate algorithm will be heavily influenced by the availability and distribution of data. With a quantitative target, we might be able to assume an additive error relationship between \(Y\) and \(\bm X\).

The additive error model is simply \(Y=f(\bm X)+\epsilon\) where \(\mathbb{E}[\epsilon]=0\) and \(\bm X\perp\epsilon\). Here we are assuming the process that generated \(Y\) can be approximated using \(\bm X\) such that the error due to ignored features and other factors has mean zero. For data that was generated under an additive error model, a natural approach to approximate \(f\) is to use a sum of basis functions. The additive model is defined as \(Y=\beta_0+\sum_{j=1}^pf_j(\bm x_j)+\epsilon\). In this model, each \(f_j\) could be a sum of basis functions. The major limitation to an additive model is we assume no interactions between features. There’s no reason we can’t add interactions or functions of multiple covariates, but identifying these interactions is difficult. We might be able to fit a model with select transformations of covariates and every possible interaction, but this would almost certainly result in overfitting. Assuming a generating model with any form of noise implies even the optimal \(f\) will not predict every observation perfectly. A model is overfit when trends are ignored to explain slight irregularities in a sample. Discerning irregularities is highly subjective. Any decision we make based on observed data may lead to overfitting. To compare models predictive capability, we use data that has been held out from training to obtain a sample of prediction errors. With this sample, we can use some metric (usually based on our loss function) for model validation.

There are a variety of methods to perform model validation, but the two most common are hold out and K-fold. In hold out, some portion of the data is held out for testing and the rest is used to fit the model. If the model contains hyper parameters, we could split the data into training, validation, and testing. Fitting models to the training set, we can try various values for hyper parameters and compare model performance for predictions on the validation set. After we determine the optimal hyper parameter values, we refit the model on the combined training and validation set before using the testing set to assess the final fit. In some cases, additional tuning may be required after combining training and validation as additional training data may change the optimal values of hyperparameters. Using training/testing splits is generally favored for computationally intensive problems with observations to spare. The primary disadvantage to hold out validation is we can’t use as much data for training.

In K-fold cross validation, we get to train and test using all available data by first splitting the data into \(k\) folds of size \(\approx N/k\). One fold is held for testing and \(k-1\) are used for training. This is then repeated until all folds have been tested. A metric can be then used to select the model before fitting it to the complete dataset. K-fold makes the most of the available data, but is more computationally expensive. Additionally, K-fold does not assess the final model fit as parameter estimates will change as the model is refit. For models with low variance, representativeness of the available data is really the only concern. If there is a lot of variation in estimated parameters across folld, our cross validation becomes more of an assessment of the theoretical as opposed to fit model. This can still useful for model selection and refinement.

So far we have discussed a general approach to problems centered around prediction. In the next subsection we explore the intuition behind tree based algorithms, beginning with Classification and Regression Trees (CART), . Later, we turn our attention to tree based models fit with Bayesian methods. Our primary focus will be on the Random forests, developed by and Bayesian Additive Regression Trees (BART), credited to . Direct predecessors of these two algorithms are discussed as well.

2.2 Decision Trees

Decision trees are a type of hierarchical model where the the feature space is partitioned into hyperrectangular regions, and a model is fit for each region. For regression tasks, we typically use mean only models. In this case the decision tree model is a step function. Unlike our mean only models for linear regression, we do not make any distributional assumptions. Without a distribution, we can’t quantify the uncertainty in our predictions. Additionally, the binary structure of trees makes the effects of individual predictors less interpretable than a linear model. Small changes in a feature \(x_{i,j}\) may not effect the associated prediction \(\hat y_i\). On the other hand, trees greatly outperform linear models with more complex data. The hierarchical structure naturally handles multicollinearity, and can help us find interactions between covariates. Even though we don’t believe \(\bm X\) is related to \(\bm y\) by a step function, we can gain insight to the true relationship by a step function approximation. An example of a tree model is shown in Figure 1. Every decision after the top root node is conditioned on the previous decisions, allowing us to display the hierarchy graphically.

flowchart TD
  A["$$X_1\ge a$$"]
  A --> |Yes| B["$$X_2\ge b$$"]
  A --> |No| C["$$X_3\ge c$$"]
  B --> |Yes| D("$$\mu_1$$")
  B --> |No| E("$$\mu_2$$")
  C --> |Yes| F["$$X_2\ge d$$"]
  C --> |No| G("$$\mu_3$$")
  F --> |Yes| H("$$\mu_4$$")
  F --> |No| I("$$\mu_5$$")
Figure 1: Visualization of binary decision tree

Step functions are very flexible and for any training set \(\bm X\) with target \(Y\) where \(y_i|\bm X_i= y_j|\bm X_i\) for all \(i,j\), we can fit a tree that will have no error on the training set. With this flexibility, over fitting is a major concern. We can control the overall size and shape of trees with three hyper-parameters. First is the complexity parameter \(\alpha\). \(\alpha\) scales the penalty which is based on the number of terminal nodes (mean only models). Next we have the minimum split and minimum bucket sizes. The minimum bucket is the fewest number of observations allowed in a terminal node. Similarly, the minimum split is the least number of observations allowed in one outcome of a binary decision. We generally choose the complexity parameter by growing the largest possible tree, then removing terminal nodes with fewer observations. Next we compare the trees via cross validation and choose the tree with minimal cross validated error. This process of choosing an appropriate tree size is commonly referred to as pruning. While pruning helps reduce the risk of over fitting, large regression trees are inherently unstable. We’ll demonstrate this with data simulated under the model described below. The rpart function developed by is used to fit our CART model. A simulated testing data set displaying the true function along with the fit from a tree chosen by cross validation on training is displayed in Figure 2. Figure 3 displays the number of splits for the optimal number of splits chosen by cross validation. \[y_i=\cos(108t_i\sin(108t_i+w_i))\cdot\exp(\sin(108t_i\cos(108t_i+x_i)))+z_i\] \[W_i,X_i,Z_i\sim N(0,\sigma^2)\] \[t_i=-\frac{\pi}{18},...,\frac{\pi}{18}\] \[i=1,2,...,350\]

Figure 2: Visualization of simulated data and predictions
Figure 3: Violin plots of optimal number of splits for simulated data

Decision trees are very unstable for this data. Across all three noise levels, the number of splits chosen by cross validation ranges from zero to over 300. The variability in the size/structure of trees suggest we can’t have much faith in predictions from an individual tree. A simpler model will be less variable, but more biased. To maintain low bias and reduce variance, we turn our attention to ensemble methods. Ensemle methods are quite prevalent in predictive modelling. By combining the predictions from several models, we can reduce the variance of predictions while capturing complex intricacies of the data.

The Random Forests algorithm, developed by , is a CART-based ensemble combining Bootstrap Aggregating (Bagging) and Ho’s Random Decision Forests . Bagging is simply averaging the predictions of models fit to bootstrapped samples. When possible, we obtain \(\hat y_i\) from averaging trees where \(\bm x_i\) did not appear in training. These predictions are reffered to as “out of bag predictions”. Cross validation on out of bag predictions is not required as they are not based on the observed values of \(\bm x_i\).

On pages 271-272 of Elements of Statistical Learning, show that the distribution of bootstrap means for infinite samples from a normal population is equivalent to to the Bayesian posterior distribution with a non-informative prior. They also demonstrate that for the multinomial likelihood, the bootstrap distribution with infinite draws from the population approximates the Bayesian posterior assuming all outcomes are equi probable apriori. Later on pages 282-288, show that the MSE of bagged estimates where samples are drawn from the population is less than or equal to the MSE of an indvidual model. For classification problems, shows that with infinite samples from the population, bagging can approximate the optimal classifier under certain conditions. This requires the majority of classifiers fit to bootstrapped samples to predict the correct class more often than not and relatively balanced classes. Otherwise, bagging can make classifiers worse. All of these results concerning infinite samples are not very useful, but hint at the practical applications of bagging.

The main limitation in bagging alone is correlation between models. If we think about averaging results from two trees \(T_1,T_2\) for predictions, the variance of our estimates will be \(\frac{1}{4}(\mathbb{V}[T_1]+\mathbb{V}[T_2]+2\cdot\text{cov}(T_1, T_2))\). When only a few variables contain relevant information, we expect trees to make similar predictions. The Random forests algorithm of combats the correlation between trees by taking the same approach implemented by Ho in Random decision forests . Random decision forests average multiple decision trees fit to the same data. The key feature in the algorithm is the random subspace method where a random subset of the available predictors is drawn for each decision. Introducing randomness brings variety to the forest, resulting in weaker correlation between trees. The combination of random subspace and bagging makes Breiman’s Random forest a flexible model with low variance. The hyperparameters for the random forest model are the number of trees to grow, the minimum and maximum number of observations allowed in a terminal node, and the number of predictors to select for each decision. In Figure 5 we compare the MSE of models fit to data simulated in the same way as the previous simulation study. The rpart and randomForest functions were used to fit models.

Figure 4: Violin plots of MSE for tree and forest models on simulated data
Figure 5: Violin plots of MSE for tree and forest models on simulated data

We see considerably less variation in the mean squared error of random forests fit to the same data as individual decision trees, but single tree models generally outperformed a random forest. Random forests are designed for data with several covariates. With a single predictor, the random forest algorithm degenerates to bagging with out of bag predictions. Bagging reduces the variance of predictions at the cost of bias, and thus the “random forest” predictions were often worse than predictions from individual trees. Pages 600, 601 of The Elements of Statistical Learning, display a proof showing that the bias of a random forest is the same as the bias of any individual tree in the ensemble. The randomization in the random forest algorithm implies trees within the ensemble are almost certainly smaller than a single tree fit to the data. In cases where we have access to several predictors, large trees will create complex hierarchies that are unlikely to exist in the true model. The following simulation study shows the random forest algorithm consistently outperform a single tree for data generated under a linear model with correlated covariates. Violin plots of MSE on testing data are displayed in Figure 6.

Figure 6: Violin plots of MSE for tree and forest models on linear model with several predictors

2.3 Bayesian CART

In the current and following section we discuss Bayesian methods for tree algorithms. Bayesian tree algorithms assume a stochastic process generates the tree, and a distribution for the observations within the terminal nodes. Randomness is used to explore possible structures that may increase the likelihood. The flexibility of trees necessitates regularization of their structure, which can be easily achieved through the use of a prior. There are many ways to specify a Bayesian tree but, our discussion is limited to the Bayesian CART method proposed by . In Bayesian CART, the search process begins with an inital stump tree, and proposes a new structure for the tree at random. For the proposal, there are four possibilities. A change step is proposed with probability 0.4. In the change step, a non-terminal node is randomly selected and the splitting rule replaced. The new rule is chosen by randomly selecting a splitting variable \(X_j\), and then randomly selecting an observed value of \(\bm x_j\) as a cutoff. The same splitting rule cannot appear multiple times within the current tree. For the grow step of the search, a terminal node is randomly selected, a splitting rule is added, and a sibling for the chosen node is grown. The probability a grow step is 0.25 and the decision rule is chosen in the same way as in the change step. The third possible move in the search is randomly selecting intermediate nodes, and collapsing all nodes beneath. This step is referred to as pruning and is proposed with probability 0.25. The last possible move for a trees structure is node swapping. Parent and child intermediate nodes are selected and the rules are swapped. After updating the structure of the tree, new nodes are marked as splittable with probability \(\frac{\alpha}{1+\beta\gamma^d}\) where \(\alpha\in(0,1)\) and \(\beta,\gamma\in(0,\infty)\) are hyperparameters and \(d\) is the depth of the selected node. Finally, terminal nodes are updated and the proposal is accepted or rejected. describe methods for independent and correlated terminal nodes. With the proposal tree constructed, the Metropolis-Hastings algorithm is used to update the sequence of trees.

Bayesian CART may become stuck around a local mode where the next tree in the sequence provides little to no improvement. To combat this, a sequence of trees is generated until the change in posterior probability between trees in the sequence stabilizes. The algorithm is then restarted to search for another local mode. Convergence is tracked by comparing the posterior probability distribution of a tree to it’s predecessors, keeping track of which trees have the highest probability. The frequency of tree structures explored is ignored, making Bayesian CART a stochastic search as opposed to a Monte Carlo based method like in BART. Averaging trees that have converged to a local mode results in a model similiar to random forests. Unlike random forests there is a distribution for the predictions.

2.4 Bayesian Additive Regression Trees (BART)

The other learning algorithm we explored was BART, or Bayesian Additive Regression Trees. For a continuous predictor, BART assumes \(y_i\sim N(\sum_{b=1}^m g(\bm x_i;T_b,M_b), \sigma^2)\) where \(T_b\) is the \(b\)’th tree, \(M_b\) are the associated terminal nodes, and \(g(\bm x_i;T_b,M_b)\) is the function that assigns \(\bm x_i\) to \(\mu_{lb}\), \(l\in\{1,2,...,|M_b|\}\). Unlike the random forest algorithm, the trees are not bagged and each tree has access to each feature. Additionally, the final model is based on the sum of all trees as opposed to an average. A key difference between the approach to growing trees in BART and Bayesian CART is that the BART tree structure is grown via MCMC as opposed to a Markov chain stochastic search. Using a highly informative prior encourages smaller trees, resulting in faster convergence and less variable predictions.

The algorithm starts by growing \(M\) trees with a single terminal node (stumps). Gibbs sampling is used to iteratively update each tree conditioned on all other trees. Conditional updates are simplified by expressing \(T_b|T_1,T_2,...T_{b-1},T_{b+1},...T_m\) as \(T_b|R_b\) where \(R_b\) denotes the distribution of residuals from predictions excluding \(T_b\). Summing trees grown conditionally on one another makes the contributions of individual nodes small. state “… we can imagine the algorithm as analogous to sculpting a complex figure by adding and subtracting small dabs of clay.” This incremental sculpting process is achieved by using a highly informative prior, also preventing any single tree from dominating the model.

BART is an adaptation of there original Bayesian CART algorithm . With the smaller tree sizes, the pruning proposal is modified to only prune terminal nodes. Even with this modification, trees will still collapse to a stump throughout MCMC. Convergence is aided by the somewhat unrealistic assumption of independent priors for each tree and the standard deviation. The joint prior is given by \[p((T_1, M_1), (T_2, M_2),...,(T_m, M_m),\sigma^2)=p(\sigma^2)\prod_{b=1}^mp(M_b|T_b)p(T_b)\] where \(p(M_b|T_b)=\prod_{l=1}^{|M_b|}p(\mu_{lb}|T_b)\), and \(p(\sigma^2)\sim\nu\lambda/\chi^2_\nu\) with hyperparameters \(\nu,\lambda\). The hyper parameters are selected by first choosing a point estimate \(\hat\sigma\). By default, this is the residual standard deviation for a multiple linear regression model. \(\nu\) is then fixed (\(\nu\in[3,10]\) recommended) and \(\lambda\) solved for by imposing the constraint \(\text{Pr}(\sigma<\hat\sigma)=q\), where \(q\) is an additional hyper parameter. In the prior for a tree \(p(T_b)\), the probability that a node at depth \(d\in\mathbb{N}\) is non-terminal is given by \(\alpha(1+d)^{-\beta}\), \(\alpha\in(0,1),\ \beta\in[0,\infty)\). A discrete uniform prior imposes the initial belief that each feature equally likely to be selected for a nodes splitting rule. Similarly, a discrete uniform across the observed values of the selected predictor serves as the prior for the cut point in the binary decision. A \(N(|M|\mu_\mu,|M|\sigma^2_\mu)\) prior is assumed for each \(\mu_{lb}|T_b\) where \(|M|\) is the total number of terminal nodes across all trees. The hyperparameters \(\mu_\mu\) and \(\sigma_\mu\) are chosen based on the data such that \(\min(Y)=|M|\mu_\mu-k\sqrt{|M|}\sigma_\mu\) and \(\max(Y)=|M|\mu_\mu-k\sqrt{|M|}\sigma_\mu\). In , they recommend choosing \(k\in[1,3]\). The BART R package resales \(Y\) such that \(Y\in[-0.5, 0.5]\) and chooses \(\mu_\mu=0\implies\sigma_\mu=\frac{0.5}{k\sqrt{|M|}}\).

Two differences between BART and traditional Bayesian models are the number of trees \(m\) is fixed and data is used to inform the priors. These features push BART to be more of a non-parametric method, but with many of the features of Bayesian models. Inference is still derived from the likelihood and hyperparameters can be tuned using a testing set with the goal of bringing prediction intervals to the nominal level. An example of a BART model tuned for nominal coverage to simulated data is displayed in Figure 7. The poor fit can likely be explained by non-additive noise in the generating function. BART is incredibly, flexible but at the end of the day it is still an additive model.

Figure 7: Visualization of bart

So far we have focused on BART and other algorithms ability to estimate functions. Ensembles of trees are not used exclusively for prediction. Two inferential tools that are commonly used with both random forests and BART are variable importance and variable interaction measures. We briefly describe these measures, using bartMan R package. Notes on variable importance and variable interaction measures are based on the documentation. Similiar measurements have been useful for inference from random forests, but the calculations are different.

Variable importance can be measurd by looking at the proportion of decisions in which a variable is used. For BART, the structure of the trees varies across iterations of MCMC, so we define \(c_{r,k}\) as the number of times variable \(r\) is used as a splitting decisions within the \(k'\)th posterior sample. \(c_k=\sum_{j=1}^Pc_{r,k}\) is thus the total number of decisions for the \(k'\)th posterior sample. Variable importance is then measured as \(\text{VImp}_r=\sum_{k=1}^K\frac{c_{r,k}}{c_k}\). Thinking of the extreme cases, with covariates that provide no information about the response, we should simply estimate \(f\) as the mean of the observed data. BART will make few decisions in this scenario as proposals will provide little to no improvement to the likelihood. On the other hand, if we had some variable that completely explained our response, BART may consistently make decisions for each observed value of the feature.

Similiar to how variable importance is measured, we can assess the potential for interactions by calculating the proportion of successive decisions concerning two variables across posterior samples. Individual decisions/sequences of decisions do not have much meaning on there own, making these algorithms hard to explain. It makese sense though that if a variable is consistently used, it must be able to explain the response to some degree. These tools can be used to interpret a BART model, or to help with variable selection for another model. Figure 8 displays a matrix of variable importance and interaction measures for data simulated under as \(Y=X_1(X_2 + X_3+X_6)+X_2\cdot X_6+\sin(X_1)+\cos(X_2)+\log(|X_3|)+X_4^2+\lfloor X_5\rfloor+\lceil X_6\rceil+\epsilon\). We see all interactions are found, but the plot suggests some evidence of an interaction between \(X_3\) and \(X_6\). The coefficient of variation is a measure used to compare one variable to another and does not have much meaning on it’s own. Like any other measure in statistics, results must be scrutinized.

Figure 8: Coefficient of variation matrix for variable importance and interaction

Before we move on to the data, we briefly discuss diagnostics for BART. Unlike the random forest algorithm, a BART model for a continuous response assumes normally distributed errors and constant variance. When using a BART model, we need to assess the normality assumption, model fit, and convergence. The bartMan R package provides a convenient tool for diagnostics. Figure 9 shows the diagnostics for the same model used for Figure 8. The Q-Q plot and Histogram both indicate normality was reasonable as expected. We see constant variance is reasonable as well in the Fitted vs Residuals Plot. The actual vs fitted plot shows BART fits the data quite well as the observed and fitted values are quite close. Variable Importance intervals shows BART correctly identified all Variables as important as the intervals do not contain zero. The Trace Plot for Sigma shows the standard deviation has converged reasonably well, but we would most likely run the sampler longer in a real analysis. Additional convergence assesments should be performed. Bart models have many parameters, and it is not unheard of to avoid assessing the convergence of each tree. The thought behind this is we don’t care about any of the individual trees, so as long as our predictions have converged the model fits reasonably well. Changing the number of trees, along with \(\alpha\) and \(\beta\) can help with convergence.

Figure 9: Diagnostic plots for BART

3 Data

Before we can apply any of the models discussed to assess the importance of land cover on flying fox feeding locations, we need to extract and transform the data. The flying fox locations themselves form a point pattern, which can’t be modelled directly by any of the algorithms discussed. We avoid modelling a point pattern by aggregating counts over a grid to create areal data. In this section we discuss the tools used to obtain and process our data.

3.1 Raster Data and Google Earth Engine

Rasters are used to store data with spatial components. Each pixel has an \((x,y)\) coordinate, and a variable of interest. We can visualize rasters by mapping the variable to a color. Gradient scales can be used for continuous values. For this project, we use rasters pulled via Google Earth Engine (GEE), . GEE has a large database of satellite images and rasters of estimated attributes. Many earth engine datasets contain images describing different attributes of the land. These attributes are labelled bands.

An important component of spatial rasters is the coordinate reference system (CRS). We can think about the satellite images as pictures of a sphere. Using the position of the camera and other factors such as distance to the earth, \((x,y,z)\) coordinates can be assigned to each pixel. For spatial data, we need to know how far apart locations are on the sphere. Looking at a picture of a sphere, we can’t determine how far apart points are due to the curvature. Gauss’s Theorema Egregium (how to cite?) implies that no 2d map perfectly portrays the distance between points. We need to choose a CRS that preserves the distance between points for our area of interest reasonably well. The location on the earth and size of the area determine this choice. Each raster has a pre-specified CRS that may or may not be suitable, but we can always project the raster into another CRS.

Working with data from earth engine we can either select an individual image, or an image collection. An image corresponds to a specific band and time, while the image collection is the entire data set. When either is selected, we have access to every time and location available. With an image, we can clip area of interest, masking data outside this region. With an image collection, we can filter the dataset to a specific region and time period. After selecting a band, an individual image can be chosen. Alternatively, we can produce an image based calculation over the filtered time period and region. Images can then be exported to Google Drive and can be loaded into with the R raster package, . After loading the raster, we can create a long formatted data structure for modelling.

3.2 Land Cover and Dynamic World

One category of datasets on Google Earth Engine is land cover. Land cover describes natural and developed features of the earth as classes. A similiar but distinct class of datasets is land use, which describes how humans use the land. There are various methods to estimate land cover, and many models estimate different classes. Additionally, different datasets span various regions and time periods. The majority of land cover datasets are annual and not quite up to date. Estimates are produced in a multi-phase modelling process, considering many images along with external prior information. Working with recent data from Australia, there are fewer options, especially after excluding low resolution datasets. In this project, we utilize data from DynamicWorld V1 by . Dynamic World is a daily high resolution land cover dataset derived from Sentinel-2 satelites. The estimated probability of each class in dynamic world is the output from a convolutional neural nets predictions on a single sattelite image.

Within Dynamic World there are ten bands, each containing a collection of images. In our analysis, we use eight relevant bands that each contain the estimated probability that a \(100\ m^2\) region of the earth corresponds to the chosen band. Unsurprisingly, daily high resolution data is rife with missingness. To combat this, we compute the median land class over a selected period of time. Missing values are ignored in the calculation by default. This also helps slightly with the fact that Dynamic World estimates are based on an individual day. Weather and atmospheric factors may influence estimates so looking at the closest day may not be advisable. We don’t expect major changes to land cover in a short period of time so this is reasonable. An individual image can then be produced based on some function of the images. A panel of the exported rasters is displayed in @dw_lc.

Dynamic World Exports from Google Earth Engine

3.3 Areal Data

https://gis.stackexchange.com/questions/323698/counting-points-in-polygons-with-sf-package-of-r

ggplot(grid_sf) +
  geom_sf(aes(fill = count), color = "white") +
  scale_fill_viridis_c() +
  theme_minimal() +
  labs(fill = "Bat Count")

ggplot() +
  geom_raster(aes(x = x, y = y, fill = water), data = rast_dat) +
  scale_fill_viridis_c() +
  ggnewscale::new_scale_fill() +
  geom_polygon(aes(x = x, y = y), fill = NA, colour = "black", data = hull_dat) +
  geom_point(aes(x = Long, y = Lat), color = "red", data = extracted_wide,
             alpha = 0.2)

4 Modelling and Results

5 Conclusion